Looks like there could be an effect of frequency; we explore whether or not to put this in the model below.
Here we are looking for an effect of stress, given the apparent importance of frequency. Upon looking at this, we’ve decided not to pursue stress as a linguistic factor. Part of the decision not to pursue stress is because of coding inconsistencies.
We briefly explores whether content vs. function words is a useful distinction to keep in the model, but ultimately decided to use word position instead (see below).
Based on this plot below, we might want to consider D separately when coding preceding environment.
Here we have collapsed preceding environment groups, but kept D separate as a result of the above visualization.
Obstruents are promoting subsequent stops, which can be viewed as an assimilation in that obstruents are the strongest of the classes, and stopping a fricative is a fortition so obstruents are conditioning stops.
With D, on the other hand, you don’t get as much of that assimilation because it results in a sequence of two identical segments and it’s difficult to parse a boundary between those so you’re more likely to still produce a fricative
We looked at vowels just in case, but the error bars on the most frequently stopped environments make this kind of useless.
As a proxy for lexical type (and even potentially frequency) we looked at whether word position might have an effect.
Based on model comparisons, we end up using word position instead of frequency. Word position results in a better model, and word position and frequency are largely colinear anyways.
We include word position as the predictor in the model (look at the plot), we note though that there are frequency differences, and when we test frequency separately it’s significant
pos_lm = lmer(binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
data = voc_data)
summary(pos_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 |
## edit_word)
## Data: voc_data
##
## REML criterion at convergence: -3024.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7076 -0.3502 -0.1937 0.0266 4.7956
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.003374 0.05809
## edit_word (Intercept) 0.000181 0.01345
## Residual 0.044517 0.21099
## Number of obs: 12403, groups: speaker.of.DH_data_concat, 189; edit_word, 31
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.02164 0.10712 9987.53119 0.202 0.840
## word_posinitial 0.05174 0.10713 9642.16226 0.483 0.629
## word_posmedial -0.01579 0.10716 9649.73305 -0.147 0.883
##
## Correlation of Fixed Effects:
## (Intr) wrd_psn
## word_posntl -0.998
## word_posmdl -0.998 0.998
Frequency Model
Fixed Effects:
Random Effects:
Speaker
Word
frq_lm = lmer(binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)
summary(frq_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 |
## edit_word)
## Data: voc_data
##
## REML criterion at convergence: -2997.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7190 -0.3500 -0.1868 0.0200 4.7863
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0033797 0.05814
## edit_word (Intercept) 0.0007295 0.02701
## Residual 0.0445248 0.21101
## Number of obs: 12403, groups: speaker.of.DH_data_concat, 189; edit_word, 31
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -0.09871 0.03403 33.55768 -2.901 0.006528 **
## Lg10WF 0.03093 0.00745 30.19704 4.152 0.000249 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Lg10WF -0.978
ANOVA for Part of Speech and Frequency Models
## refitting model(s) with ML (instead of REML)
## Data: voc_data
## Models:
## frq_lm: binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## pos_lm: binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## frq_lm 5 -3003.7 -2966.6 1506.8 -3013.7
## pos_lm 6 -3031.7 -2987.1 1521.8 -3043.7 29.976 1 4.374e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We chose position because it resulted in a better model, frequency and position are colinear
pos_preceding_lm = lmer(binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
data = voc_data)
summary(pos_preceding_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) +
## (1 | edit_word)
## Data: voc_data
##
## REML criterion at convergence: -3517
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0750 -0.4633 -0.1013 0.0831 4.8370
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 3.462e-03 0.058839
## edit_word (Intercept) 8.605e-05 0.009277
## Residual 4.270e-02 0.206642
## Number of obs: 12403, groups: speaker.of.DH_data_concat, 189; edit_word, 31
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4.832e-02 1.048e-01 1.124e+04 0.461 0.644890
## word_posinitial 5.611e-03 1.046e-01 1.109e+04 0.054 0.957220
## word_posmedial -1.992e-02 1.045e-01 1.110e+04 -0.191 0.848842
## preceding_catobstruent 9.225e-02 7.418e-03 2.319e+03 12.437 < 2e-16 ***
## preceding_catsonorant -2.623e-02 7.588e-03 1.427e+03 -3.456 0.000564 ***
## preceding_catvowel -2.268e-02 8.219e-03 2.577e+03 -2.759 0.005830 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts
## word_posntl -0.997
## word_posmdl -0.995 0.998
## prcdng_ctbs -0.052 0.001 0.001
## prcdng_ctsn -0.055 0.004 0.001 0.714
## prcdng_ctvw -0.080 0.035 0.004 0.651 0.669
We’re filtering out frequency, but not including it in the model because there is no variation in the low-frequency terms
We now turn to just the Hispanic/Latinx subset of the data.
As we saw with the graph in the full dataset, there doesn’t appear to be an effect of birth year.
We looked at the effect of gender for just the Hispanic/Latinx population. There does not appear to be an effect for this.
We also looked more closely at education for just the Latinx/Hispanic group.
Significant difference between high school and college:
voc_hispanic <- voc_data %>%
filter(descent == "hispanic/latinx") %>%
filter(education_cont %in% c(1,3))
educ_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
## boundary (singular) fit: see help('isSingular')
summary(educ_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + education_cont + (1 |
## speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: -167.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.4182 -0.4917 -0.1202 0.0976 4.2883
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.003137 0.05601
## edit_word (Intercept) 0.000000 0.00000
## Residual 0.054279 0.23298
## Number of obs: 3954, groups: speaker.of.DH_data_concat, 61; edit_word, 30
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.13545 0.03161 80.46229 4.285 5.02e-05 ***
## word_posmedial -0.02923 0.01200 3928.10783 -2.435 0.0149 *
## preceding_catobstruent 0.12465 0.01478 3915.02736 8.436 < 2e-16 ***
## preceding_catsonorant -0.03181 0.01504 3914.88482 -2.115 0.0345 *
## preceding_catvowel -0.03894 0.01659 3915.09862 -2.347 0.0190 *
## education_cont -0.02451 0.01059 57.92026 -2.315 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv
## word_posmdl 0.008
## prcdng_ctbs -0.348 -0.002
## prcdng_ctsn -0.348 -0.061 0.728
## prcdng_ctvw -0.316 -0.544 0.660 0.681
## educatn_cnt -0.887 -0.008 0.002 0.009 0.009
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
voc_white <- voc_data %>%
filter(descent == "white") %>%
filter(education_cont %in% c(2,3))
educ_white_lm = lmer(binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_white)
summary(educ_white_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + education_cont + (1 |
## speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_white
##
## REML criterion at convergence: -1261.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.4179 -0.3995 -0.1234 0.0570 4.9564
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0020868 0.04568
## edit_word (Intercept) 0.0002217 0.01489
## Residual 0.0399591 0.19990
## Number of obs: 3652, groups: speaker.of.DH_data_concat, 56; edit_word, 30
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -5.469e-02 1.236e-01 2.280e+03 -0.442 0.6583
## word_posinitial 2.035e-02 1.175e-01 3.150e+03 0.173 0.8625
## word_posmedial -1.109e-02 1.173e-01 3.150e+03 -0.095 0.9247
## preceding_catobstruent 8.346e-02 1.290e-02 1.311e+03 6.470 1.38e-10 ***
## preceding_catsonorant -2.196e-02 1.302e-02 9.315e+02 -1.687 0.0920 .
## preceding_catvowel -5.788e-03 1.417e-02 1.419e+03 -0.409 0.6829
## education_cont 2.948e-02 1.434e-02 5.343e+01 2.055 0.0447 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv
## word_posntl -0.947
## word_posmdl -0.943 0.994
## prcdng_ctbs -0.075 0.003 0.001
## prcdng_ctsn -0.080 0.008 0.002 0.706
## prcdng_ctvw -0.118 0.056 0.008 0.642 0.668
## educatn_cnt -0.296 -0.007 -0.007 -0.001 -0.006 -0.001
Because there are bilingual speakers of English/Spanish in the Latinx/Hispanic group, we start to look at the potential effect of bilingualism on dh-stopping.
There does seem to be an effect of bilingualism, but we now check how bilingualism interacts with education:
Final Model:
voc_hispanic <- voc_data %>%
filter(descent == "hispanic/latinx")
bilingual_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat + spanish_bilingual*education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + spanish_bilingual *
## education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: -541.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9780 -0.5130 -0.1061 0.1051 4.4243
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 4.047e-03 0.063615
## edit_word (Intercept) 6.683e-05 0.008175
## Residual 5.095e-02 0.225728
## Number of obs: 5374, groups: speaker.of.DH_data_concat, 84; edit_word, 30
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) -0.05561 0.07076 82.22597 -0.786
## word_posmedial -0.02436 0.01060 138.76869 -2.299
## preceding_catobstruent 0.10952 0.01226 1355.24155 8.932
## preceding_catsonorant -0.02927 0.01257 866.74651 -2.328
## preceding_catvowel -0.03995 0.01376 1719.94883 -2.904
## spanish_bilingualyes 0.19768 0.07538 78.95706 2.623
## education_cont 0.03389 0.02231 78.84693 1.519
## spanish_bilingualyes:education_cont -0.05839 0.02444 78.90459 -2.389
## Pr(>|t|)
## (Intercept) 0.43419
## word_posmedial 0.02302 *
## preceding_catobstruent < 2e-16 ***
## preceding_catsonorant 0.02012 *
## preceding_catvowel 0.00373 **
## spanish_bilingualyes 0.01047 *
## education_cont 0.13279
## spanish_bilingualyes:education_cont 0.01929 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_
## word_posmdl -0.007
## prcdng_ctbs -0.128 -0.003
## prcdng_ctsn -0.124 -0.056 0.722
## prcdng_ctvw -0.107 -0.514 0.654 0.674
## spnsh_blngl -0.919 0.002 0.002 -0.003 -0.008
## educatn_cnt -0.970 -0.002 0.003 0.001 -0.004 0.911
## spnsh_bln:_ 0.886 -0.002 -0.006 -0.001 0.005 -0.974 -0.913
There is an interesting pattern emerging whereby bilinguals with only HS show the highest proportion of dh-stopping, but in monolingual speakers dh-stopping numerically correlates with more, not less, education.
Originally we had coded mixed descent, but for ambiguous coding reasons we decided not to explore this any further.
We look here at the field sites, but only the Hispanic/Latinx speakers.
This section needs to be tidied up!!!!
voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAC")
voc_hispanic <- voc_data %>%
filter(site %in% fieldsub) %>%
filter(descent == "hispanic/latinx")
bilingual_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat + spanish_bilingual*education_cont + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + spanish_bilingual *
## education_cont + site + (1 | speaker.of.DH_data_concat) +
## (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: -424.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0058 -0.5160 -0.1116 0.1138 4.4466
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0040973 0.06401
## edit_word (Intercept) 0.0001053 0.01026
## Residual 0.0512832 0.22646
## Number of obs: 4691, groups: speaker.of.DH_data_concat, 74; edit_word, 29
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) -0.09841 0.07613 70.14537 -1.293
## word_posmedial -0.02574 0.01164 134.03503 -2.212
## preceding_catobstruent 0.11078 0.01323 1337.86948 8.374
## preceding_catsonorant -0.03045 0.01352 926.37066 -2.252
## preceding_catvowel -0.04087 0.01482 1696.68959 -2.757
## spanish_bilingualyes 0.18212 0.07922 67.55428 2.299
## education_cont 0.03683 0.02356 67.37238 1.563
## siteBAK 0.04583 0.02649 67.25072 1.730
## siteSAL 0.06609 0.02573 67.26394 2.568
## spanish_bilingualyes:education_cont -0.05971 0.02575 67.45179 -2.319
## Pr(>|t|)
## (Intercept) 0.20036
## word_posmedial 0.02863 *
## preceding_catobstruent < 2e-16 ***
## preceding_catsonorant 0.02458 *
## preceding_catvowel 0.00589 **
## spanish_bilingualyes 0.02460 *
## education_cont 0.12269
## siteBAK 0.08825 .
## siteSAL 0.01244 *
## spanish_bilingualyes:education_cont 0.02343 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_ sitBAK
## word_posmdl -0.010
## prcdng_ctbs -0.126 -0.004
## prcdng_ctsn -0.125 -0.054 0.726
## prcdng_ctvw -0.108 -0.505 0.656 0.675
## spnsh_blngl -0.874 0.000 0.001 -0.005 -0.007
## educatn_cnt -0.937 -0.003 0.003 0.001 -0.001 0.908
## siteBAK -0.225 0.009 -0.005 0.005 -0.004 -0.070 -0.010
## siteSAL -0.182 0.007 -0.006 0.003 -0.003 -0.090 -0.046 0.750
## spnsh_bln:_ 0.849 0.000 -0.005 0.001 0.004 -0.970 -0.912 0.052
## sitSAL
## word_posmdl
## prcdng_ctbs
## prcdng_ctsn
## prcdng_ctvw
## spnsh_blngl
## educatn_cnt
## siteBAK
## siteSAL
## spnsh_bln:_ 0.035
Latinx
voc_data %>%
filter(descent == "hispanic/latinx") %>%
filter(site %in% fieldsub) %>%
group_by(spanish_bilingual,education_cont,site) %>%
summarize(prop = mean(binary_stop),
count = n(),
CI.Low = ci.low(binary_stop),
CI.High = ci.high(binary_stop),
YMin = prop - CI.Low,
YMax = prop + CI.High) %>%
ggplot(aes(x=education_cont,y=prop,alpha=count)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin = YMin, ymax=YMax),
position = "dodge",
width=0.25) +
theme_minimal() +
labs(x="Spanish Bilingual", y="Proportion of stop realizations", alpha = "Token Count") +
facet_wrap(spanish_bilingual~site)
## `summarise()` has grouped output by 'spanish_bilingual', 'education_cont'. You
## can override using the `.groups` argument.
BAKSAL <- c("BAK","SAL")
voc_data %>%
filter(descent == "hispanic/latinx") %>%
filter(site %in% BAKSAL) %>%
group_by(education_cont,site) %>%
summarize(prop = mean(binary_stop),
count = n(),
CI.Low = ci.low(binary_stop),
CI.High = ci.high(binary_stop),
YMin = prop - CI.Low,
YMax = prop + CI.High) %>%
ggplot(aes(x=education_cont,y=prop,alpha=count)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin = YMin, ymax=YMax),
position = "dodge",
width=0.25) +
theme_minimal() +
labs(x="Education", y="Proportion of stop realizations", alpha = "Token Count") +
facet_wrap(~site)
## `summarise()` has grouped output by 'education_cont'. You can override using
## the `.groups` argument.
Bakersfield is a mix between monolinguals and bilinguals, SAL results being driven by bilinguals
voc_data %>%
filter(site %in% fieldsub) %>%
group_by(site,spanish_bilingual) %>%
summarize(participants = n_distinct(speaker.of.DH_data_concat))
## `summarise()` has grouped output by 'site'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: site [3]
## site spanish_bilingual participants
## <fct> <chr> <int>
## 1 SAC no 42
## 2 SAC yes 5
## 3 BAK no 37
## 4 BAK yes 17
## 5 SAL no 8
## 6 SAL yes 31
voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAL")
voc_hispanic <- voc_data %>%
filter(site %in% BAKSAL) %>%
filter(descent == "hispanic/latinx")
bilingual_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat + spanish_bilingual + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + spanish_bilingual +
## site + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: 86.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9531 -0.5204 -0.1274 0.1154 4.1152
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0051335 0.07165
## edit_word (Intercept) 0.0001718 0.01311
## Residual 0.0573700 0.23952
## Number of obs: 4079, groups: speaker.of.DH_data_concat, 64; edit_word, 29
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.08042 0.02596 101.81583 3.098 0.00252 **
## word_posmedial -0.03042 0.01358 121.47691 -2.241 0.02687 *
## preceding_catobstruent 0.11949 0.01515 1160.52527 7.886 7.13e-15 ***
## preceding_catsonorant -0.03821 0.01559 833.16416 -2.451 0.01443 *
## preceding_catvowel -0.04775 0.01703 1540.71470 -2.804 0.00510 **
## spanish_bilingualyes 0.01110 0.02293 60.60032 0.484 0.63028
## siteBAK -0.01461 0.02021 60.49472 -0.723 0.47248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_
## word_posmdl -0.050
## prcdng_ctbs -0.421 -0.005
## prcdng_ctsn -0.412 -0.046 0.732
## prcdng_ctvw -0.372 -0.496 0.662 0.674
## spnsh_blngl -0.713 0.008 -0.018 -0.023 -0.014
## siteBAK -0.440 0.006 0.002 0.001 -0.002 0.187
voc_data %>%
filter(descent == "hispanic/latinx") %>%
filter(site == "SAL") %>%
group_by(education_cont) %>%
summarize(prop = mean(binary_stop),
count = n(),
CI.Low = ci.low(binary_stop),
CI.High = ci.high(binary_stop),
YMin = prop - CI.Low,
YMax = prop + CI.High) %>%
ggplot(aes(x=education_cont,y=prop,alpha=count)) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin = YMin, ymax=YMax),
position = "dodge",
width=0.25) +
theme_minimal() +
labs(x="Education Value", y="Proportion of stop realizations", alpha = "Token Count") +
scale_alpha(range = c(.3,.9))
WE ARE KEEPING WITHOUT!! not looking at mixed because its meaning for Latinx speakers is ambiguous
ToDo:
Important Findings: - MODEL 1: - Latinx speakers have more stopping
MODEL 2:
Education only main effect for Latinx speakers, between HS and College
Sacramento has a low rate among Latinx speakers
All fieldsites except SAC have pretty similar rates of stopping
MODEL 2B (WHITE PPL MODEL):
Education effect for white speakers between some college and finished college (overall uninterpretable pattern)
MODEL 3:
Not replicating finding of bilingualism present in whole dataset when we subset to just BAK and SAL
voc_data %>%
group_by(realization) %>%
summarize(count = n())
## # A tibble: 4 × 2
## realization count
## <chr> <int>
## 1 deleted 1964
## 2 fricative 8612
## 3 intermediate 1185
## 4 stop 642
BRAN TO FIX REFERENCE LEVEL HERE!!!!!!!!!!
pos_lm = lmer(binned_binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
data = voc_data)
summary(pos_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) +
## (1 | edit_word)
## Data: voc_data
##
## REML criterion at convergence: 8175.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.92700 -0.56627 -0.30792 0.04076 3.14480
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.010117 0.10058
## edit_word (Intercept) 0.001306 0.03613
## Residual 0.109403 0.33076
## Number of obs: 12403, groups: speaker.of.DH_data_concat, 189; edit_word, 31
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 3.288e-02 1.705e-01 4.939e+03 0.193 0.847
## word_posinitial 1.567e-01 1.707e-01 4.498e+03 0.918 0.359
## word_posmedial 1.765e-02 1.708e-01 4.546e+03 0.103 0.918
##
## Correlation of Fixed Effects:
## (Intr) wrd_psn
## word_posntl -0.997
## word_posmdl -0.997 0.996
frq_lm = lmer(binned_binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)
summary(frq_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) +
## (1 | edit_word)
## Data: voc_data
##
## REML criterion at convergence: 8195.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9342 -0.5639 -0.3033 0.0357 3.1648
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.010121 0.10060
## edit_word (Intercept) 0.003731 0.06108
## Residual 0.109380 0.33073
## Number of obs: 12403, groups: speaker.of.DH_data_concat, 189; edit_word, 31
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -0.18028 0.07166 31.53935 -2.516 0.017178 *
## Lg10WF 0.06733 0.01588 29.30850 4.241 0.000204 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Lg10WF -0.980
ANOVA for Part of Speech and Frequency Models
## refitting model(s) with ML (instead of REML)
## Data: voc_data
## Models:
## frq_lm: binned_binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## pos_lm: binned_binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## frq_lm 5 8192.7 8229.8 -4091.3 8182.7
## pos_lm 6 8171.7 8216.2 -4079.8 8159.7 22.999 1 1.621e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pos_preceding_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
data = voc_data)
summary(pos_preceding_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## binned_binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) +
## (1 | edit_word)
## Data: voc_data
##
## REML criterion at convergence: 7484.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2011 -0.5610 -0.2281 0.0685 3.2510
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0102649 0.10132
## edit_word (Intercept) 0.0008984 0.02997
## Residual 0.1032691 0.32136
## Number of obs: 12403, groups: speaker.of.DH_data_concat, 189; edit_word, 31
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.06671 0.16524 6121.43421 0.404 0.6864
## word_posinitial 0.08435 0.16497 5644.30228 0.511 0.6091
## word_posmedial 0.01107 0.16482 5687.37189 0.067 0.9465
## preceding_catobstruent 0.17255 0.01192 5968.36530 14.470 < 2e-16 ***
## preceding_catsonorant -0.05153 0.01234 3854.24265 -4.175 3.05e-05 ***
## preceding_catvowel -0.02587 0.01322 5164.58738 -1.957 0.0504 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts
## word_posntl -0.996
## word_posmdl -0.994 0.996
## prcdng_ctbs -0.054 0.002 0.001
## prcdng_ctsn -0.056 0.003 0.000 0.726
## prcdng_ctvw -0.081 0.035 0.005 0.665 0.668
We’re filtering out frequency, but not including it in the model because there is no variation in the low-frequency terms
To look at!
Significant difference between high school and college:
voc_hispanic <- voc_data %>%
filter(descent == "hispanic/latinx") %>%
filter(education_cont %in% c(1,3))
educ_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(educ_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + education_cont +
## (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: 3063.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.93785 -0.58209 -0.25503 0.08146 2.97118
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0103260 0.10162
## edit_word (Intercept) 0.0008229 0.02869
## Residual 0.1220237 0.34932
## Number of obs: 3954, groups: speaker.of.DH_data_concat, 61; edit_word, 30
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.27355 0.05522 77.88480 4.954 4.14e-06 ***
## word_posmedial -0.08700 0.02172 77.91077 -4.005 0.000141 ***
## preceding_catobstruent 0.19944 0.02330 1538.80420 8.560 < 2e-16 ***
## preceding_catsonorant -0.05679 0.02402 1116.58293 -2.365 0.018223 *
## preceding_catvowel -0.05593 0.02595 1770.61216 -2.155 0.031306 *
## education_cont -0.02792 0.01852 57.02897 -1.507 0.137305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv
## word_posmdl -0.052
## prcdng_ctbs -0.320 -0.009
## prcdng_ctsn -0.321 -0.049 0.748
## prcdng_ctvw -0.290 -0.443 0.682 0.692
## educatn_cnt -0.888 -0.007 0.002 0.008 0.008
voc_white <- voc_data %>%
filter(descent == "white") %>%
filter(education_cont %in% c(2,3))
educ_white_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_white)
summary(educ_white_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + education_cont +
## (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_white
##
## REML criterion at convergence: 1794.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.5222 -0.5275 -0.2275 0.0390 3.2811
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.005563 0.07458
## edit_word (Intercept) 0.001261 0.03551
## Residual 0.091991 0.30330
## Number of obs: 3652, groups: speaker.of.DH_data_concat, 56; edit_word, 30
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -0.09907 0.19076 1840.92025 -0.519 0.6036
## word_posinitial 0.11549 0.18060 2278.09223 0.639 0.5226
## word_posmedial 0.03756 0.18028 2281.14951 0.208 0.8350
## preceding_catobstruent 0.16155 0.02007 1864.67149 8.049 1.48e-15 ***
## preceding_catsonorant -0.04658 0.02040 1382.42445 -2.284 0.0225 *
## preceding_catvowel 0.00015 0.02203 1842.64148 0.007 0.9946
## education_cont 0.03812 0.02304 54.84947 1.654 0.1038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv
## word_posntl -0.942
## word_posmdl -0.938 0.993
## prcdng_ctbs -0.077 0.003 0.001
## prcdng_ctsn -0.081 0.008 0.002 0.716
## prcdng_ctvw -0.119 0.056 0.009 0.653 0.673
## educatn_cnt -0.309 -0.006 -0.006 -0.002 -0.006 -0.002
Final Model:
voc_hispanic <- voc_data %>%
filter(descent == "hispanic/latinx")
bilingual_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual*education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual *
## education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: 4075.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1396 -0.5876 -0.2530 0.1164 3.0978
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0132811 0.11524
## edit_word (Intercept) 0.0007517 0.02742
## Residual 0.1196839 0.34595
## Number of obs: 5374, groups: speaker.of.DH_data_concat, 84; edit_word, 30
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) 0.24732 0.12495 81.40889 1.979
## word_posmedial -0.07929 0.01910 65.60834 -4.152
## preceding_catobstruent 0.19130 0.01946 1976.36198 9.833
## preceding_catsonorant -0.03999 0.02016 1319.58586 -1.984
## preceding_catvowel -0.04854 0.02173 2182.80497 -2.233
## spanish_bilingualyes 0.03235 0.13314 78.26178 0.243
## education_cont -0.02002 0.03941 78.16766 -0.508
## spanish_bilingualyes:education_cont -0.01360 0.04317 78.22478 -0.315
## Pr(>|t|)
## (Intercept) 0.0511 .
## word_posmedial 9.72e-05 ***
## preceding_catobstruent < 2e-16 ***
## preceding_catsonorant 0.0475 *
## preceding_catvowel 0.0256 *
## spanish_bilingualyes 0.8087
## education_cont 0.6128
## spanish_bilingualyes:education_cont 0.7537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_
## word_posmdl -0.025
## prcdng_ctbs -0.117 -0.007
## prcdng_ctsn -0.114 -0.045 0.736
## prcdng_ctvw -0.098 -0.431 0.670 0.679
## spnsh_blngl -0.920 0.001 0.002 -0.001 -0.006
## educatn_cnt -0.971 -0.002 0.003 0.002 -0.003 0.911
## spnsh_bln:_ 0.886 -0.001 -0.005 -0.002 0.004 -0.974 -0.913
voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAC")
voc_hispanic <- voc_data %>%
filter(site %in% fieldsub) %>%
filter(descent == "hispanic/latinx")
bilingual_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual*education_cont + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual *
## education_cont + site + (1 | speaker.of.DH_data_concat) +
## (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: 3503
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1717 -0.6017 -0.2459 0.1503 3.1211
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.0125872 0.11219
## edit_word (Intercept) 0.0008417 0.02901
## Residual 0.1179785 0.34348
## Number of obs: 4691, groups: speaker.of.DH_data_concat, 74; edit_word, 29
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) 1.438e-01 1.305e-01 6.915e+01 1.102
## word_posmedial -8.024e-02 2.031e-02 6.558e+01 -3.950
## preceding_catobstruent 1.968e-01 2.067e-02 1.817e+03 9.517
## preceding_catsonorant -3.413e-02 2.130e-02 1.283e+03 -1.602
## preceding_catvowel -5.331e-02 2.308e-02 2.040e+03 -2.310
## spanish_bilingualyes -5.414e-03 1.358e-01 6.663e+01 -0.040
## education_cont -1.449e-02 4.040e-02 6.647e+01 -0.359
## siteSAL 1.452e-01 4.413e-02 6.645e+01 3.291
## siteBAK 1.114e-01 4.543e-02 6.643e+01 2.451
## spanish_bilingualyes:education_cont -1.279e-02 4.414e-02 6.655e+01 -0.290
## Pr(>|t|)
## (Intercept) 0.274200
## word_posmedial 0.000193 ***
## preceding_catobstruent < 2e-16 ***
## preceding_catsonorant 0.109301
## preceding_catvowel 0.020986 *
## spanish_bilingualyes 0.968320
## education_cont 0.720866
## siteSAL 0.001600 **
## siteBAK 0.016874 *
## spanish_bilingualyes:education_cont 0.772897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_ sitSAL
## word_posmdl -0.026
## prcdng_ctbs -0.116 -0.007
## prcdng_ctsn -0.116 -0.045 0.738
## prcdng_ctvw -0.100 -0.433 0.669 0.679
## spnsh_blngl -0.875 -0.001 0.001 -0.003 -0.006
## educatn_cnt -0.937 -0.003 0.003 0.002 -0.002 0.908
## siteSAL -0.182 0.006 -0.006 0.002 -0.003 -0.089 -0.045
## siteBAK -0.225 0.007 -0.005 0.003 -0.003 -0.070 -0.010 0.750
## spnsh_bln:_ 0.849 0.001 -0.005 0.000 0.004 -0.970 -0.912 0.034
## sitBAK
## word_posmdl
## prcdng_ctbs
## prcdng_ctsn
## prcdng_ctvw
## spnsh_blngl
## educatn_cnt
## siteSAL
## siteBAK
## spnsh_bln:_ 0.052
Bakersfield is a mix between monolinguals and bilinguals, SAL results being driven by bilinguals
voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAL")
voc_hispanic <- voc_data %>%
filter(site %in% BAKSAL) %>%
filter(descent == "hispanic/latinx")
bilingual_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual +
## site + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## Data: voc_hispanic
##
## REML criterion at convergence: 3291.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1376 -0.5894 -0.2477 0.1547 3.0383
##
## Random effects:
## Groups Name Variance Std.Dev.
## speaker.of.DH_data_concat (Intercept) 0.013893 0.11787
## edit_word (Intercept) 0.001193 0.03454
## Residual 0.125345 0.35404
## Number of obs: 4079, groups: speaker.of.DH_data_concat, 64; edit_word, 29
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.24657 0.04230 100.38812 5.829 6.78e-08 ***
## word_posmedial -0.08757 0.02322 65.04878 -3.772 0.000353 ***
## preceding_catobstruent 0.21369 0.02310 1748.21599 9.249 < 2e-16 ***
## preceding_catsonorant -0.03301 0.02395 1291.88573 -1.378 0.168360
## preceding_catvowel -0.05630 0.02584 2037.85154 -2.179 0.029429 *
## spanish_bilingualyes -0.04197 0.03716 59.68009 -1.129 0.263236
## siteBAK -0.02645 0.03274 59.55654 -0.808 0.422519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_
## word_posmdl -0.107
## prcdng_ctbs -0.399 -0.008
## prcdng_ctsn -0.391 -0.038 0.745
## prcdng_ctvw -0.353 -0.422 0.676 0.680
## spnsh_blngl -0.710 0.006 -0.017 -0.021 -0.013
## siteBAK -0.437 0.004 0.002 0.000 -0.002 0.186
Things to do next time:
Investigate Gender Build models for the intermediate realizations Gender x Bilingualism ? Model 4 (Salinas’ Version)
Social Factors
Race and Ethnicity
We began by looking at whether the proportion of dh-stopping differed by race/ethnicity.
Hispanic and Latinx speakers have a higher proportion of stop realizations than do white or Black speakers.
Evidence that Hispanic speakers are stopping more than white speakers, but no evidence that they are doing so more than Black speakers, likely as a result of low token counts and a relative lack of variation within the Black speakers
Birth Year
We looked at birth year for all groups:
Race and Birth Year
We also investigated whether or not the three racial categories showed changes over the last ~60 years.
It appears in this dataset that birth year is not a significant predictor of stop proportions.
Field Site
We also examined how dh-stopping proportions vary across the field sites in our dataset. We later focus in on Salinas, Bakersfield, and Sacramento because of relatively lower token counts in other locations.
Field Site by Race/Ethnicity
We also looked at the interaction between race and fieldsite. Note that some field sites have groups not using stopping at all (e.g. only white speakers in Redlands show stopping behavior)
Bakersfield is the only community in which we see significant amounts of stopping for all ethnic groups and we don’t want to interpret the very high rate of stopping in HUM because it’s a low token count. We can be reasonably confident that the rate of stopping in Hispanic speakers in SAL because it’s the intersection of site and race that is best represented.
Gender
Next we turn to gender (operationalized binarily).
No effect of gender for Hispanic population, but definitely for white speakers and probably (?) for Black speakers. As a result of the model below, we don’t keep gender for the final model.
Model
Education
Probably not enough data in 1,2,4 to draw any significant conclusions about a main effect of education
Education and Race/Ethnicity
Just in case, we next look at education crossed with race/ethnicity.
There seems to be a pattern for Hispanic/Latinx speakers such that those who only completed high school show the highest proportion of dh-stopping. For Black speakers, there is probably not enough data to find a pattern, and the results are unclear for white speakers.
Model
The group with the highest rate of stopping NUMERICALLY is Hispanic HS, significantly different from college students but we don’t have enough data to say if they are significantly different from some college/AS speakers. This was achieved by a by-hand ordinal analysis.